Academic Tracker: Software for tracking and reporting publications associated with authors and grants

In recent years, United States federal funding agencies, including the National Institutes of Health (NIH) and the National Science Foundation (NSF), have implemented public access policies to make research supported by funding from these federal agencies freely available to the public. Enforcement is primarily through annual and final reports submitted to these funding agencies, where all peer-reviewed publications must be registered through the appropriate mechanism as required by the specific federal funding agency. Unreported and/or incorrectly reported papers can result in delayed acceptance of annual and final reports and even funding delays for current and new research grants. So, it’s important to make sure every peer-reviewed publication is reported properly and in a timely manner. For large collaborative research efforts, the tracking and proper registration of peer-reviewed publications along with generation of accurate annual and final reports can create a large administrative burden. With large collaborative teams, it is easy for these administrative tasks to be overlooked, forgotten, or lost in the shuffle. In order to help with this reporting burden, we have developed the Academic Tracker software package, implemented in the Python 3 programming language and supporting Linux, Windows, and Mac operating systems. Academic Tracker helps with publication tracking and reporting by comprehensively searching major peer-reviewed publication tracking web portals, including PubMed, Crossref, ORCID, and Google Scholar, given a list of authors. Academic Tracker provides highly customizable reporting templates so information about the resulting publications is easily transformed into appropriate formats for tracking and reporting purposes. The source code and extensive documentation is hosted on GitHub (https://moseleybioinformaticslab.github.io/academic_tracker/) and is also available on the Python Package Index (https://pypi.org/project/academic_tracker) for easy installation.


Introduction
Since 2008, the United States government has passed laws and issued directives to promote public access to peer-reviewed publications resulting from federal funding. These requirements started with Division G, Title II Section 218 of the Public Law (PL) 110-161 also known as the Consolidated Appropriations Act of 2008 [1], which directed the National Institutes for Health (NIH) to require all peer-reviewed publications supported by NIH funds to be electronically submitted to PubMed [2] within 12 months of the official date of publication [3]. Second in 2013, the White House Office of Science & Technology Policy (OSTP) mandated that all federal agencies with research and development budgets over $100 million to develop public access plans for research publications and data resulting from grants provided by these federal agencies [4]. Shortly thereafter in 2014, the US Congress passed the FY 2014 Omnibus Appropriations Act [5], which required federal agencies under Labor, Health and Human Services, and Education with research budgets of $100 million or more to provide public online access to peer-reviewed publications within 12 months of the official data of publication [6]. To comply with federal law, both NIH and NSF have implemented public access policies to make research supported by funding from these federal agencies freely available to the public. The enforcement of these policies typically occurs during the submission of annual and final reporting process for funded grants from NIH and NSF. In these reports, all peer-reviewed publications must be registered through the required mechanism by the specific federal funding agency. For NIH, peer-reviewed publications must be registered with PubMed Central and have a PubMed Central ID (PMCID). For NSF, peer-reviewed publications must be submitted to the NSF Public Access Repository (NSF-PAR) via Research.gov in the form of an archival PDF (PDF/A) [7]. Unreported and/or incorrectly reported papers can result in delayed acceptance of annual and final reports and funding delays for current and new research grants. Therefore, timely reporting of every peer-reviewed publication is required. For large collaborative research efforts involving large research teams or even multiple research teams, the tracking and proper registration of peer-reviewed publications along with generation of accurate annual and final reports can create a large administrative burden. With large collaborative teams, it is easy for these administrative tasks to be overlooked, forgotten, or lost in the shuffle.
In an effort to help researchers and their minders stay up-to-date with the reporting of peer-reviewed publications, we created the Academic Tracker software package. Written in the Python 3 programming language, Academic Tracker comprehensively searches major peer-reviewed publication tracking web portals, gathering relevant publications and useful tracking characteristics, for example, an indication of whether the publication has been reported to the NIH (is on PubMed), needs to be reported (is associated with an NIH grant), or satisfies the NIH's requirements to have a PMCID. It has the ability to search PubMed [2], ORCID [8], Google Scholar [9], and Crossref [10], given a list of authors and/or author IDs. Academic Tracker provides highly customizable reporting templates so information about the resulting publications is easily transformed into appropriate formats for tracking and reporting purposes.
ORCID (Open Researcher and Contributor ID) is a non-profit organization dedicated to uniquely identifying individuals who participate in research [8]. Once an author is registered, ORCID provides a unique ID that can be used to associate an author with their publications. These associations can be easily accessed from the ORCID website or through their application programming interface (API). Google Scholar is a search engine for scholarly literature with similar API search facilities to ORCID [9]. Authors can create profiles on Google Scholar, which Google Scholar uses to automatically associate publications with. Crossref is a nonprofit association with both commercial and non-profit publisher members with a primary P42 ES007380 (PI Pennell; co-I HNBM) via the Data Management and Analysis Core (DMAC), and NIH U54 TR001998-05A1 (PI Kern; co-I HNBM). There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. purpose of enabling cross-publishing citation linking [10]. Crossref's stated goal is to make "research objects easy to find, cite, link, assess, and reuse." For the purposes of Academic Tracker, Crossref serves as a database with an easily accessible API for finding relevant publications.
Academic Tracker has three main use-cases and one supportive use-case. The first main use-case searches the aforementioned web portals for publications, given a list of authors. The second main use-case searches PubMed and Crossref for publication information, given a list of publication citations. Neither ORDID nor Google Scholar can be searched for specific publication information directly. ORCID is organized around author profiles and not publications themselves and does not provide a search option by publication characteristics. Google Scholar cannot be searched by specific publication characteristics, because Google Scholar has limited the repetitive programmatic use of their web service in this way. However, Google Scholar does allow repetitive programmatic search by author profile ID. The third main use-case finds collaborators given a list of authors. This is similar to the first use-case, but focuses on compiling the co-authors from the publications rather than the publications themselves. The fourth supportive use-case searches ORCID or Google Scholar for authors' unique IDs for these sources, given a list of authors.
The main output from the three main use-cases is a Javascript Object Notation (JSON) file containing information about each publication found. Other outputs vary on user settings. Customizable summary and project reports can be created with an option of emailing them as attachments. The collaborator report of the third use-case is also customizable. All emails are also copied into a JSON file. A configuration JSON file is needed as part of the input to Academic Tracker and the fourth supportive use-case will update this file with the information found during the search. A use-case diagram for Academic Tracker is shown in Fig 1.   Fig 1. Academic Tracker use-case diagram. The first and third use-cases, publication search and collaborator search, are illustrated via the "Publication Search by Author" option. The second use-case, publication information, is illustrated via the "Publication Search by Reference" option. The supporting use-case, ORCID ID and Google Scholar ID searches, are illustrated by the "Unique ID Search" option.

rd party packages
Academic Tracker leverages many third-party Python libraries and packages to accomplish its major tasks. Academic Tracker uses the docopt library to implement a command line interface (CLI) from a Python docstring description. Next, Academic Tracker uses the jsonschema library to validate user JSON input against an expected schema, which is also in JSON format. JSON Schema is an independently developed vocabulary or framework created for the purpose of validating and annotating JSON. Other developers have implemented the vocabulary in several languages, and the jsonschema library is the Python language implementation. The specific schema used in Academic Tracker are in the Validation_Schemas directory of the supplemental materials. Academic Tracker also uses four different packages to query data sources for publications. Specifically, Academic Tracker uses the pymed, habanero, orcid, and scholarly libraries to query PubMed, Crossref, ORCID, and Google Scholar, respectively. For the second use-case, Academic Tracker uses the requests library to make HTTP requests and the beautifulsoup4 library to parse HTML in the pulled web pages given as the reference file. Next, Academic Tracker uses the fuzzywuzzy library to fuzzy match publication titles, which is necessary because publications do not have a universal unique identifier. For general file input/output, Academic Tracker uses several packages, including: i) the python-docx library to read Microsoft Word files, specifically for the reference file input; ii) the pandas library to read and write tabular data, specifically to read in author data and write out reports; and iii) indirectly the openpyxl library, which is used by pandas to write Excel files. In order to comprehensively compare publication information across different runs to see if any information has changed, Academic Tracker uses the deepdiff library. A list of packages and their versions are in Table 1.

Use cases
Although there are 3 main use-cases and 1 supportive use-case, Academic Tracker has 2 main commands and 6 supporting commands ( Table 2). The first and third main use-cases are handled by the author_search command, while the second main use-case is handled by the reference_search command. The supportive use-case is handled by the find_ORCID and find_Google_Scholar commands. The remaining four commands help users experiment with the tokenization and reporting systems in Academic Tracker and make it a little easier to convert author information into JSON format. The commands are listed in Table 2. The input and output files for each command are further described in Table 3.

Module description
Although Academic Tracker is primarily designed to be a command line tool, it does provide an equivalent API, which can be utilized if so desired. The CLI and highest-level API for each command are implemented in the __main__.py file, but other submodules break down the steps into smaller pieces. Utilizing the API, reference_search and author_search are almost completely separated into their own submodules. The athr_srch_modularized.py submodule compartmentalizes the steps of author_search, while the athr_srch_webio.py and athr_srch_e-mails_and_reports.py submodules contain the functions to interface with the internet and generate reports and emails respectively. reference_search is organized the same way with the ref_srch_modularized.py, ref_srch_webio.py, and ref_srch_emails_and_reports.py submodules. The user_input_checking.py submodule contains the functions to validate user input for errors, and the tracker_schema.py submodule works in tandem with it to store the JSON schema being used for validation. The fileio.py submodule contains all the functions for reading and writing files. The webio.py submodule contains functions to interface with the internet that are more general purpose or common to multiple commands. It is where the functions to interface with the internet for find_ORCID and find_Google_Scholar are. The helper_functions.py submodule contains functions with common operations across all commands that don't classify well into any other submodule, such as regex operations and data transformation. The citation_parsing.py submodule contains all the functions used to tokenize the reference sources for reference_search. Table 4 lists the submodules of Academic Tracker, and Fig  2 shows a module diagram.

Testing
The Academic Tracker package was originally developed in a Linux operating system (OS) environment, but has been directly tested on Linux, Windows, and MacOS operating systems. All use-cases have been tested on these operating systems; however, Academic Tracker relies on sendmail or an emulator being installed and configured on the machine for its email functionality. In addition, each submodule includes unit-tests that test all critical functions of the submodule. Every function in every module is tested to make sure it gives the expected output when it should and errors when it should. All requests to web portals are replaced with mock data. The user_input_checking.py submodule has the largest number of tests, since it tests several error states for each element of the input JSON files. Every command line option is tested, for example, silent and not searching ORCID options. Various ways of creating reports are also tested, such as creating a tabular report versus a text report, Excel versus CSV format, and renaming the report from the default name. Several different citation styles and sources are also tested to make sure they are tokenized correctly, such as MEDLINE, a MyNCBI bibliography URL, and an NSF Award page.

Results
Academic Tracker can be utilized in many different ways and was designed with a great deal of flexibility, anticipating users' desire to use it in unpredictable ways. However, the three main and one supportive use-case are presented here. Note that the figures here are general examples with mostly dummy data. There are full examples with real data and run commands in the supplemental materials (Example_Runs subdirectory). The first main use-case involves searching for publications given author information. Fig 3 shows an example input configuration JSON file, the command line for its execution, the API execution equivalent, and the resulting output files. Fig 4 shows the contents of these resulting output files. Authors without unique ORCID or Google Scholar IDs are identified by matching first name, last name, and at least one affiliation. The second main use-case involves looking for publications based on a given reference. Fig  5 shows an example input configuration JSON file, the command line for its execution, the API execution equivalent, and the resulting output files. Figs 6 and 7 show the contents of the resulting output files.
The third use-case is basically identical to the first, but a collaborator report attribute needs to be added to an author.   contents of the publications JSON and summary report. Table 5 shows the contents of the resulting collaborator report table.
The supportive use-case is broken into 2 commands: find_ORCID for finding ORCID IDs and find_Google_Scholar for finding Google Scholar IDs. Fig 9 shows an example input configuration JSON, how to accomplish this using the command line and API, and the resulting output files for finding ORCID IDs. Fig 10 shows

Discussion and conclusions
Academic Tracker is a useful tool for querying major scientific publication web portals for publications, given a list of authors or references and for creating highly customizable reports from the list of publications found. The software package provides assistance in repetitive tracking and reporting of peer-reviewed publications associated with specific authors, projects, and grants. Specifically, the JSON configuration file supports batch execution, directing Academic Tracker to perform multiple related author searches and report generations. The JSON configuration file has many optional parameters to customize searching and report generation, including a cutoff_year for searching. Academic Tracker is also designed for repetitive tracking by comparing current search results to prior search results to limit reporting to changes in publications detected and in publication attributes. Academic Tracker also provides facilities for generating lists of co-author collaborators, which has several uses in grant proposal submission. But given the number of major use-cases and versality of the software, there is some intellectual overhead required to initially setup the JSON configuration file and customize reports. Additional supportive commands are included to make learning and troubleshooting the tool easier for new users. Also, there is extensive documentation available to help with the learning curve: https://moseleybioinformaticslab.github.io/academic_tracker/ In addition, when installed via the Python package management system pip, a console script "academic_tracker" is created automatically for the user, providing easy access to the CLI.
While the package accesses multiple major peer-reviewed publication tracking web portals, it is fundamentally limited to the information provided by these web portals and must assume the information provided is accurate. One possibility is to download a PDF of the publication itself for analysis. However, this is pragmatically infeasible, since there is wide variation in how journals organize the splash page of their publications. One way to alleviate this issue is for

PLOS ONE
journals to adopt a DOI extension like ".pdf" which would link directly to the PDF version of the publication, if the PDF version is accessible. This is similar to the versioning ".v#" DOI extension that FigShare uses to provide access each version of a public FigShare repository. If a practical way to directly access the PDF is implemented either by journals or the publication tracking web portals, we would extend Academic Tracker to utilize it. Still in its current implementation, we believe Academic Tracker can significantly reduce the stress and hassle of reporting publications to federal funding agencies, reducing the chance for accidental noncompliance and resulting delay in funding.